Your task is to read in the input csv file, and use the verbs you have learned this week to get it into a workable dataset for analysis.
The data we’re going to use for this lab comes from a data bank of facial recognition data ratings. In this dataset, people were shown faces and asked to rated different characteristcs about them. People were free to respond however they wanted, and data was collected from many different people. You can start to imagine that the incoming dataset might be a little messy…
Let’s start by loading the tidyverse library and reading the data in. You learned earlier this week how to get a feel for your data when you first load it in. Find out the dimenions, variable names, and print the first 10 rows.
Answer
library(tidyverse)
# load in the data
face_descriptions <- read_csv("data/face_descriptions.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## user_id = col_double(),
## age = col_double()
## )
## See spec(...) for full column specifications.
# get dimensions, variable names, and print the first 10 rows
dim(face_descriptions)
## [1] 220 53
names(face_descriptions)
## [1] "user_id" "sex" "age" "t1" "t2" "t3" "t4"
## [8] "t5" "t6" "t7" "t8" "t9" "t10" "t11"
## [15] "t12" "t13" "t14" "t15" "t16" "t17" "t18"
## [22] "t19" "t20" "t21" "t22" "t23" "t24" "t25"
## [29] "t26" "t27" "t28" "t29" "t30" "t31" "t32"
## [36] "t33" "t34" "t35" "t36" "t37" "t38" "t39"
## [43] "t40" "t41" "t42" "t43" "t44" "t45" "t46"
## [50] "t47" "t48" "t49" "t50"
face_descriptions %>%
head(10)
Do these variable names look tidy to you? What format is your data in (long or wide)?
Answer
If we have a look at these descriptions, we can see that there’s a lot of variation. Some participants provided one word descriptions whilst others provided multiple word descriptions. Some wrote their descriptions in lower case, whilst others wrote in upper case and used punctuation (e.g. commas, hyphens, slashes). As a result, this data isn’t organised in a clear, logical way.
Each row is a unique ID of the person doing the rating. However, the characteristics they rate on are spread across 50 different columns (wide format).
Being faced with such complex data can be daunting, we may feel overwhelmed and ask ourselves:
The first problem you can always tackle is to get your data in an appropriate format. What format do you need?
Once you have figured that out, reshape the data into the appropriate format.
pivot_longer() function to gather the data into long format. You’ll need to put all columns beginning with t into one column.
Answer
face_descriptions_long <- face_descriptions %>%
pivot_longer(cols = starts_with("t"),
names_to = "face_id",
values_to = "description")
face_descriptions_long
Now, each row has one observation - meaning that we firstly have all participants descriptions of face t1, followed by all participants descriptions of face t2, and so on…
But wait! Some people have given lengthy descriptions of the faces, some have only given one word. This can be problematic for analysis, if you eventually want to summarise key descriptions.
Some people have split the description of the faces using a slash. Use the separate function to split apart the descriptions column so that if people have given more than one answer (i.e. if the data is separated by a /) it is moved into a new row.
First, you’ll need to decide a cut off for how many responses you want to accept from people. Is one enough? Two? Three? Once you’ve decided how many descriptions you want to code, you’ll have to separate your description columns into that many columns.
Answer
# I had a look through the data, and decided that three was the most common response if they'd gone beyond one description. Some people gave four, but that wasn't many. So i'll stick with three... you might have something different.
# separate it
separated_descriptions <- face_descriptions_long %>%
separate(description, c("d1","d2","d3"), sep = "/")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 10959
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
separated_descriptions
Alternative answer - cSplit
There is also another handy function cSplit from the ‘splitstackshape’ package, which performs similarly to seperate but you don’t need to specify how many columns you want the data to be split into.
library(splitstackshape)
face_descriptions_long %>%
cSplit("description", sep = "/")
Can even use the argument direction = long which means don’t need to do an extra step to make long again:
face_descriptions_long %>%
cSplit('description', sep = "/", direction = "long")
However, people also used other separators like ‘;’ or ‘,’. Can you extend the code to account for these separators too?
separate multiple times, but you will need to change the extra argument and the remove argument; have a look at help file.
Answer
separated_descriptions <- face_descriptions_long %>%
separate(description, c("d1","d2","d3"), sep = "/", extra = "merge", remove = FALSE) %>%
separate(description, c("d1","d2","d3"), sep = ";", extra = "merge", remove = FALSE) %>%
separate(description, c("d1","d2","d3"), sep = ", ", extra = "merge", remove = FALSE)
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 10959
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 10962
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 10658
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
separated_descriptions %>% filter(user_id == 509665 | user_id == 509196)
We’ve now split the data, and have three description variables. But is this data ACTUALLY tidy? Isn’t one of the rules that we need to have only one column per variable? And now we have more than one description variables…
What do we need to do here so our description variables follow the rules of tidy data?
pivot_longer() function to create a new description column, which will identify which description number it is (1,2,3, etc), and to create one single variable called description which contains the actual description. Save it all in a new tibble called all_descriptions. Answer
all_descriptions <- separated_descriptions %>%
pivot_longer(cols = starts_with("d"),
names_to = "description_number",
values_to = "description")
all_descriptions
But, wait… another problem arises… not everyone provided three descriptions! So, if you look at the data, we have some missing values (NAs). We also have some nonsense descriptions (e.g. “f” and “.”). Let’s now sort these out.
Use thefilter function to get rid of NA’s and useless descriptions. is.na() function. If you want to keep everything that is not equal to NA, what would you need to do? If you wanted to make sure you kept everything where the description variable had more than one character, is there a function you could use? This is a task extension - you haven’t used this function before, but try googling for a function that counts the number of characters in a variable. You can then use a logical operator (which we also learned about this week), to ensure you only select where there is more than 1 character present.
Answer
tidy_descriptions <- all_descriptions %>%
filter(!is.na(description),
nchar(description) > 1)
tidy_descriptions
This data is starting to look better! We have taken some not so tidy data, and moved all the way through to tidy data!
Now we can actually find something out about our data. Let’s find out what the most common description of each face is. Earlier in the week you learnt how to pipe functions together to create summary stats.
Group the data by description, and summarise the data by generating a count for each group.
Answer
# group the data
grouped_faces <- tidy_descriptions %>%
group_by(description)
head(grouped_faces)
# summarise and generate a count for each
sum_faces <- summarise(grouped_faces, n = n())
sum_faces
Finally, arrange the output so that we have the most common description at the top, and the least common at the bottom.
Hint: do you need ascending or descending order for this?
Create a tibble called top_10_descriptions, which filters your arranged data so that it only takes the top 10 answers.
This will help us answer the question: what are the most common descriptions of faces given?
Answer
top_10_descriptions <- sum_faces %>%
arrange(desc(n)) %>%
filter(row_number() < 11 ) # could also use head(10) or slice(1:10)
top_10_descriptions
Congrats! You have successfully gone from messy data to summary stats - exactly what you’ve been learning to do over the past couple of days. Well done.
So from that messy dataset, we now have a nice summary table of the 10 most common descriptions of faces. And we did it quickly! But one of the most useful things we learnt this week was how to create a pipe. Try your hand at changing the code above into a pipe. Start from the very beginning, where you load in the data. Save it all in a tibble called faces.
Answer
# pipe it
faces <- face_descriptions %>%
pivot_longer(cols = starts_with("t"), names_to = "face_id", values_to = "description") %>%
separate(description, c("d1","d2","d3"), sep = ", ", extra = "merge", remove = FALSE) %>%
separate(description, c("d1","d2","d3"), sep = ";", extra = "merge", remove = FALSE) %>%
separate(description, c("d1","d2","d3"), sep = "\\/", extra = "merge") %>%
pivot_longer(cols = starts_with("d"), names_to = "description_number", values_to = "description") %>%
filter(!is.na(description), nchar(description) > 1) %>%
mutate(description = trimws(description),
description = tolower(description)) %>%
group_by(description) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(row_number() < 11)
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 10658
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 10962
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 10959
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
Now we have reshaped our data, tidied it, and summarised it, and we’ve even made our code more efficient by turning it into a pipe! The output table is nice, but it would be better if we could present the results in a graph rather than a table. If you’re finished, you can take on this finally piece of analysis.
Google and see if you can find out how to use ggplot. Make a bar chart which shows the number of responses of your top 10 descriptions. Make sure you give it appropriate x and y labels, and an informative title. Be as creative as you like.
Answer
faces %>%
mutate(description = factor(description, levels = description)) %>%
ggplot(aes(description,n, fill = description)) +
geom_col(show.legend = FALSE) +
coord_flip() +
scale_fill_brewer(palette = "Spectral") +
ggtitle("Top 10 descriptors for eye region expressions")